18 research outputs found
Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing
The development of distributed training strategies for statistical prediction functions
is important for applications of machine learning, generally, and the development
of distributed structured prediction training strategies is important for natural
language processing (NLP), in particular. With ever-growing data sets this is, first, because,
it is easier to increase computational capacity by adding more processor nodes
than it is to increase the power of individual processor nodes, and, second, because
data sets are often collected and stored in different locations.
Iterative parameter mixing (IPM) is a distributed training strategy in which each
node in a network of processors optimizes a regularized average loss objective on its
own subset of the total available training data, making stochastic (per-example) updates
to its own estimate of the optimal weight vector, and communicating with the
other nodes by periodically averaging estimates of the optimal vector across the network.
This algorithm has been contrasted with a close relative, called here the single-mixture
optimization algorithm, in which each node stochastically optimizes an average
loss objective on its own subset of the training data, operating in isolation until
convergence, at which point the average of the independently created estimates is returned.
Recent empirical results have suggested that this IPM strategy produces better
models than the single-mixture algorithm, and the results of this thesis add to this
picture.
The contributions of this thesis are as follows.
The first contribution is to produce and analyze an algorithm for decentralized
stochastic optimization of regularized average loss objective functions. This algorithm,
which we call the distributed regularized dual averaging algorithm, improves over
prior work on distributed dual averaging by providing a simpler algorithm (used in the
rest of the thesis), better convergence bounds for the case of regularized average loss
functions, and certain technical results that are used in the sequel.
The central contribution of this thesis is to give an optimization-theoretic justification
for the IPM algorithm. While past work has focused primarily on its empirical
test-time performance, we give a novel perspective on this algorithm by showing that,
in the context of the distributed dual averaging algorithm, IPM constitutes a convergent
optimization algorithm for arbitrary convex functions, while the single-mixture
distribution algorithm is not. Experiments indeed confirm that the superior test-time
performance of models trained using IPM, compared to single-mixture, correlates with
better optimization of the objective value on the training set, a fact not previously reported.
Furthermore, our analysis of general non-smooth functions justifies the use of
distributed large-margin (support vector machine [SVM]) training of structured predictors,
which we show yields better test performance than the IPM perceptron algorithm,
the only version of the IPM to have previously been given a theoretical justification.
Our results confirm that IPM training can reach the same level of test performance
as a sequentially trained model and can reach better accuracies when one has a fixed
budget of training time.
Finally, we use the reduction in training time that distributed training allows to experiment
with adding higher-order dependency features to a state-of-the-art phrase-structure
parsing model. We demonstrate that adding these features improves out-of-domain
parsing results of even the strongest phrase-structure parsing models, yielding
a new state-of-the-art for the popular train-test pairs considered. In addition, we show
that a feature-bagging strategy, in which component models are trained separately and
later combined, is sometimes necessary to avoid feature under-training and get the best
performance out of large feature sets
Analysis of shared common genetic risk between amyotrophic lateral sclerosis and epilepsy
Because hyper-excitability has been shown to be a shared pathophysiological mechanism, we used the latest and largest genome-wide studies in amyotrophic lateral sclerosis (n = 36,052) and epilepsy (n = 38,349) to determine genetic overlap between these conditions. First, we showed no significant genetic correlation, also when binned on minor allele frequency. Second, we confirmed the absence of polygenic overlap using genomic risk score analysis. Finally, we did not identify pleiotropic variants in meta-analyses of the 2 diseases. Our findings indicate that amyotrophic lateral sclerosis and epilepsy do not share common genetic risk, showing that hyper-excitability in both disorders has distinct origins
Genomic Relationships, Novel Loci, and Pleiotropic Mechanisms across Eight Psychiatric Disorders
Genetic influences on psychiatric disorders transcend diagnostic boundaries, suggesting substantial pleiotropy of contributing loci. However, the nature and mechanisms of these pleiotropic effects remain unclear. We performed analyses of 232,964 cases and 494,162 controls from genome-wide studies of anorexia nervosa, attention-deficit/hyper-activity disorder, autism spectrum disorder, bipolar disorder, major depression, obsessive-compulsive disorder, schizophrenia, and Tourette syndrome. Genetic correlation analyses revealed a meaningful structure within the eight disorders, identifying three groups of inter-related disorders. Meta-analysis across these eight disorders detected 109 loci associated with at least two psychiatric disorders, including 23 loci with pleiotropic effects on four or more disorders and 11 loci with antagonistic effects on multiple disorders. The pleiotropic loci are located within genes that show heightened expression in the brain throughout the lifespan, beginning prenatally in the second trimester, and play prominent roles in neurodevelopmental processes. These findings have important implications for psychiatric nosology, drug development, and risk prediction.Peer reviewe
Finishing the euchromatic sequence of the human genome
The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers âŒ99% of the euchromatic genome and is accurate to an error rate of âŒ1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead
Genomic reconstruction of the SARS-CoV-2 epidemic in England.
The evolution of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus leads to new variants that warrant timely epidemiological characterization. Here we use the dense genomic surveillance data generated by the COVID-19 Genomics UK Consortium to reconstruct the dynamics of 71 different lineages in each of 315 English local authorities between September 2020 and June 2021. This analysis reveals a series of subepidemics that peaked in early autumn 2020, followed by a jump in transmissibility of the B.1.1.7/Alpha lineage. The Alpha variant grew when other lineages declined during the second national lockdown and regionally tiered restrictions between November and December 2020. A third more stringent national lockdown suppressed the Alpha variant and eliminated nearly all other lineages in early 2021. Yet a series of variants (most of which contained the spike E484K mutation) defied these trends and persisted at moderately increasing proportions. However, by accounting for sustained introductions, we found that the transmissibility of these variants is unlikely to have exceeded the transmissibility of the Alpha variant. Finally, B.1.617.2/Delta was repeatedly introduced in England and grew rapidly in early summer 2021, constituting approximately 98% of sampled SARS-CoV-2 genomes on 26 June 2021
Logic and the comprehension of language
This thesis examines what is necessary to formally model a hearer\u27s comprehension of a natural language sentence. Our theory of comprehension should at least explain how different words within the same grammatical class make different contributions to the meaning of a sentence. And, our theory should explain how the ``full propositional form\u27\u27 that a speaker communicates is recovered from the relatively semantically underspecified acoustic signal. A model is provided which achieves this. A speaker is said to understand an utterance by, first, choosing the maximally ``relevant\u27\u27 full propositional semantic enrichment of the underspecified acoustic signal, measured according to a formally defined comparison operator, and, then, computing the inferences that follow from that chosen propositional form in conjunction with their individual word-/world-knowledge. This model of comprehension apparently makes comprehension relative to an individual\u27s idiosyncratic knowledge. So, I also discuss how conventionalized word-meanings co-ordinate individuals\u27 knowledges to allow successful interpersonal communication
Analysis of shared common genetic risk between amyotrophic lateral sclerosis and epilepsy
Because hyper-excitability has been shown to be a shared pathophysiological mechanism, we used the latest and largest genome-wide studies in amyotrophic lateral sclerosis (n = 36,052) and epilepsy (n = 38,349) to determine genetic overlap between these conditions. First, we showed no significant genetic correlation, also when binned on minor allele frequency. Second, we confirmed the absence of polygenic overlap using genomic risk score analysis. Finally, we did not identify pleiotropic variants in meta-analyses of the 2 diseases. Our findings indicate that amyotrophic lateral sclerosis and epilepsy do not share common genetic risk, showing that hyper-excitability in both disorders has distinct origins
Recommended from our members
Shared genetic basis between genetic generalized epilepsy and background electroencephalographic oscillations
ObjectiveParoxysmal epileptiform abnormalities on electroencephalography (EEG) are the hallmark of epilepsies, but it is uncertain to what extent epilepsy and background EEG oscillations share neurobiological underpinnings. Here, we aimed to assess the genetic correlation between epilepsy and background EEG oscillations.MethodsConfounding factors, including the heterogeneous etiology of epilepsies and medication effects, hamper studies on background brain activity in people with epilepsy. To overcome this limitation, we compared genetic data from a genome-wide association study (GWAS) on epilepsy (n = 12 803 people with epilepsy and 24 218 controls) with that from a GWAS on background EEG (n = 8425 subjects without epilepsy), in which background EEG oscillation power was quantified in four different frequency bands: alpha, beta, delta, and theta. We replicated our findings in an independent epilepsy replication dataset (n = 4851 people with epilepsy and 20 428 controls). To assess the genetic overlap between these phenotypes, we performed genetic correlation analyses using linkage disequilibrium score regression, polygenic risk scores, and Mendelian randomization analyses.ResultsOur analyses show strong genetic correlations of genetic generalized epilepsy (GGE) with background EEG oscillations, primarily in the beta frequency band. Furthermore, we show that subjects with higher beta and theta polygenic risk scores have a significantly higher risk of having generalized epilepsy. Mendelian randomization analyses suggest a causal effect of GGE genetic liability on beta oscillations.SignificanceOur results point to shared biological mechanisms underlying background EEG oscillations and the susceptibility for GGE, opening avenues to investigate the clinical utility of background EEG oscillations in the diagnostic workup of epilepsy